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While  Moore’s  Law  theoretically  doubles  processor  performance  every  24  months,  much  of  the 
realizable  performance  remains  untapped  because  the  burden  falls  to  the  (less  informed)  domain 
scientist  or  engineer  to  exploit  parallel  hardware  for  performance  gains.  Even  when  such 
untapped  hardware  potential  is  fully  realized,  it  is  often  not  coupled  with  advances  in  algorithmic 
innovation,  which  can  deliver  further  (multiplicative)  speed-up  beyond  Moore’s  Law,  as  noted  in 
the  AFOSR  BAA.  For  example,  in  a  heterogeneous  system  containing  a  CPU  and  GPU,  a 
straightforward  1600-core  GPU  parallelization  of  a  CPU-based  n-body  code  for  molecular 
modeling  resulted  in  only  an  88.4-fold  speed-up  over  a  serial,  but  SSE-vectorized,  CPU  code.  An 
additional  4.2-fold  was  extracted  when  applying  architecture-aware  GPU  optimizations,  resulting 
in  a  371 -fold  speed-up.  By  also  leveraging  algorithmic  innovation  via  a  hierarchical  charge 
partitioning  algorithm,  we  delivered  an  additional  216-fold  speed-up,  resulting  in  a  multiplicative 
speed-up  of  80,000-fold. 

Therefore,  in  this  project,  we  formalized  the  aforementioned  co-design  process  for  the  n-body 
computational  motif  and  adapted  and  applied  it  to  the  structured  grid  and  unstructured  grid 
motifs  found  in  computational  fluid  dynamics  (CFD)  in  support  of  aerodynamic  predictions  for 
micro-air  vehicles  (MAVs).  While  many  past  efforts  to  develop  such  CFD  codes  on  accelerated 
processors  showed  limited  success,  our  hardware/software  co-design  approach  created  malleable 
algorithms  that  could  be  mapped  and  optimized  onto  the  right  type  of  processing  core  at  the  right 
time,  and  in  turn,  deliver  an  order-of-magnitude  better  performance  than  would  have  otherwise 
been  possible  by  Moore’s  Law  alone.  To  further  enhance  our  co-design  process,  we  engaged 
hardware  vendors  to  support  our  effort,  and  in  turn,  our  research  has  assisted  in  guiding  their 
future  hardware  design,  e.g.,  our  GPU-accelerated  HokieSpeed  supercomputer. 


Overview 

Many  past  efforts  to  develop  computational  fluid  dynamics  (CFD)  codes  in  heterogeneous 
computing  systems  that  consist  of  accelerated  processors,  such  as  graphical  processing  units 
(GPUs),  have  demonstrated  limited  success.  This  predicament  was  due  in  part  to  the  relatively 
naive  mapping  of  traditional  CPU-based  algorithms  onto  accelerated  processors  like  the  GPU 
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rather  than  a  synergistic  hardware/software  co-design  approach  that  (1)  re-factors  the  CPU-based 
CFD  algorithms  and  co-designs  them  to  accelerated  processors  or  (2)  starts  from  first  principles 
that  underlie  CFD,  e.g.,  the  fundamental  computational  motifs  of  structured/unstructured  grids, 
and  co-designs  them  in  the  context  of  these  accelerated  processors. 

To  achieve  the  desired  metric(s)  of  success  with  respect  to  speed,  programmability,  portability, 
power  consumption,  energy  efficiency,  and  combinations  thereof,  the  different  combinations  of 
algorithm,  software,  and  hardware  need  to  be  judiciously  co-designed.  In  formalizing  the  co¬ 
design  approach  that  we  used  for  molecular  modeling  and  applying  it  to  CFD,  we  researched  and 
developed  CFD  algorithms  and  supporting  hardware  for  micro-air  vehicle  (MAV)  simulations 
that  achieved  substantial  speed-up  over  current  simulations  and  provided  significantly  better 
hardware  utilization.  Such  an  increase  in  performance,  realized  via  our  co-design  approach, 
allowed  large  eddy  simulation  (LES)  calculations  on  100-million  element  meshes  to  become 
routine.  However,  significant  challenges  existed  in  mapping  specific  codes  designed  for  such 
simulations  in  heterogeneous  computing  environments,  consisting  of  CPUs  and  GPUs  (and  very 
soon,  accelerated  processing  units  or  APUs  that  combine  the  CPU  and  GPU  onto  the  same 
processor  die). 

The  high-performance  simulation  of  MAVs  required  a  number  of  performance-related  choices  at 
the  method  and  algorithm  levels  that  were  co-designed  with  the  underlying  systems  software 
(e.g.,  parallel  libraries  and  run-time  system)  and  hardware.  In  turn,  system  software  and 
hardware  choices  needed  to  optimally  support  the  algorithmic  requirements  of  MAV  simulation. 
As  an  example,  consider  a  simulation  that  requires  the  solution  of  the  unsteady  Reynolds- 
Averaged  Navier-Stokes  equations  (RANS)  using  the  pressure  projection  approach,  where  we 
have  a  choice  between  structured  or  unstructured  finite-element  meshes  and  explicit  or  implicit 
time  steps.  For  this  scenario,  we  used  a  mesh-partitioning  algorithm  to  provide  a  suitable  domain 
decomposition.  The  subdomains  were  then  distributed  over  the  nodes,  such  that  the  mapping 
respected  proximity  of  adjoining  subdomains  and  with  multiple  subdomains  per  node,  allowing 
subdomains  to  be  reassigned  dynamically  for  load  balancing. 

The  pressure-projection  approach  requires  the  solution  of  a  linear  system  (Poisson  equation)  for 
the  pressure  in  each  time  step,  even  in  the  case  of  explicit  time  steps.  Here,  we  assumed  the 
problem  was  sufficiently  large  that  iterative  solvers  were  required.  An  implicit  time  step  required 
the  solution  of  a  fully  coupled  nonlinear  system  of  equations.  Assuming  a  Newton-based 
approach,  this  involved  computing  a  sequence  of  Jacobian  matrices  and  solving  the  associated 
linear  systems  (which  are  larger  than  for  the  explicit  case).  Hence,  we  created  efficient  linear 
solvers  towards  delivering  high  performance  in  CFD  simulations. 

Each  iteration  of  a  linear  solver  involves  one  sparse  matrix- vector  product  (matvec)  and 
preconditioner- vector  product  (precvec,  often  with  similar  characteristics  as  the  sparse  matrix- 
vector  product),  multiple  dot  products,  and  multiple  vector  updates.  (Note:  The  pattern  of 
computation  and  communication  captured  by  this  linear  solver  is  referred  to  as  the  sparse  linear 
algebra  motif  or  dwarf). 

The  main  performance  issues  involved  (1)  reducing  the  cost  of  communication  and 
synchronization  between  nodes,  (2)  load  balancing  (both  computation  and  communication) 
between  nodes  and  between  the  processors  on  a  node,  (3)  keeping  the  number  of  iterations  low 
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while  using  an  efficient  parallel  preconditioner,  (4)  high  (local)  performance  on  (accelerated) 
processors  for  matvec,  precvec,  and  dot  product,  and  (5)  vector  update.  While  we  achieved  many 
improvements  via  modest  modifications  of  standard  algorithms.  We  also  considered  hardware 
choices  that  support  the  efficient  implementation  of  such  modifications,  including  the  reduction 
of  global  communication  and  synchronization  in  inner  products. 

While  the  use  of  unstructured  meshes,  leading  to  irregular  memory  access,  has  little  impact  on 
the  communication  cost  between  nodes,  it  results  in  a  significant  impact  on  the  efficient  use  of 
the  processors  and  caches,  particularly  GPUs.  Hence,  high  performance  (by  itself)  might  favor 
the  choice  of  structured  meshes. 

Both  for  structured  and  unstructured  meshes,  the  sparse  matvec  has  very  low  flops  per  data  fetch, 
making  the  operation  typically  memory  bound.  Various  algorithmic  transformations  can  make 
the  use  of  unstructured  meshes  more  efficient.  Creating  many  relatively  small  subdomains  and 
(re)ordering  unknowns  by  subdomain  led  to  better  locality.  In  addition,  using  appropriate  matrix 
data  structures  (e.g.,  Cray/Ellpack  storage)  and  explicit  prefetching  of  data  significantly 
improved  performance  for  general  sparse  matvecs.  The  (vectorized)  matvec  for  structured 
meshes  can  be  implemented  efficiently  by  diagonals,  possibly  with  some  blocking.  Improving 
the  low  number  of  flops  per  data  item  is  more  complicated,  but  it  can  be  improved  by  various 
algorithmic  modifications,  such  as  combining  S  subsequent  iterations  ( S  times  the  work  but 
(most)  matrix  elements  must  be  fetched  only  once). 

Finally,  we  produced  a  range  of  solver  improvements,  such  as  Krylov  subspace  recycling  and 
updating  preconditioners,  that  significantly  reduced  the  number  of  iterations  while  introducing 
modest  overhead  that,  in  addition,  had  better  performance  characteristics  on  parallel,  multi-core 
machines.  For  example,  the  overhead  in  Krylov  subspace  recycling  involves  a  matvec  with 
multiple  vectors  at  once,  and  the  orthogonalization  of  (a  group  of)  vectors  against  a  group  of 
orthogonal  vectors.  Krylov  subspace  recycling  will  be  particularly  relevant  to  the  fully  implicit 
approach  requiring  the  solution  of  large  nonlinear  systems  of  equations. 

The  relevant  motifs  above  are  sparse  linear  algebra,  structured  or  unstructured  grids,  and  dense 
linear  algebra  for  small  (relative  to  the  total  problem)  local  problems.  Co-designing  motifs, 
supported  by  the  run-time  system,  and  modified  solvers  led  to  significant  performance 
improvements  and  easier  programming  at  the  same  time. 

Some  of  the  algorithmic  modifications  above  that  influenced  hardware  choices  included 
combining  multiple  iterations  of  the  linear  solvers,  Krylov  subspace  recycling,  and  special 
implementations  of  the  sparse  matvec  for  matrices  from  unstructured  meshes.  We  considered 
hardware  that  supports  loading  multiple  (sub)vectors  and  efficient  multiplication  by  a  single 
(sub)matrix,  can  store/cache  relatively  large  amounts  of  data,  has  relatively  high  bandwidth  to 
cache  or  memory,  and  has  high  bandwidth  access  to  accelerator.  In  turn,  we  optimized  our 
algorithms  to  improve  flops  per  data  item  fetched  and  favored  algorithms  and  methods  that 
allowed  modifications  that  increased  this  ratio  without  introducing  too  much  overhead. 

Important  co-design  parameters  included  (1)  the  size  and  ordering  of  subdomains/patches;  (2) 
size  of  sub-  domain  buffer  (ghost)  space  (for  message  aggregation  and  longer  local 
computations);  (3)  block  sizes  in  algorithms  and  underlying  data  structures  tuned  to  local/fast 
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memories.  Blocks  can  correspond  to  finite-  element  method  (FEM)  elements,  but  can  be  smaller 
if  needed;  and  (4)  number  of  concurrent  matvecs  (possibly  with  submatrices). 

In  all,  the  research  from  this  three-year  grant  resulted  in  more  than  50  publications.  A  brief 
listing  of  our  research  innovations  is  provided  below. 

•  Evolution  of  four  distinctive  GPU-accelerated  CFD  codes:  GENIDLEST,  SENSEI, 
RDGFLOW,  and  INCOMP3D,  of  which  two  achieved  an  order-of-magnitude  speed-up. 

•  Detailed  characterization  of  the  programmability  and  performance  challenges  faced  by 
each  of  the  four  CFD  codes  on  a  GPU,  e.g., 

o  Memory  boundedness  and  limited  memory 
o  Inter-cell  and  intra-cell  dependency 
o  Excessive  branching 
o  Mixed  computational  granularity 

o  Optimization  of  common  computational  and  communication  idioms,  i.e.,  dwarfs 
or  motifs,  such  as  ghost-cell  packing 

o  Co-design  of  hardware,  software,  and  algorithm  simultaneously,  e.g.,  how  to 
design  a  linear  solver  algorithm  and  realize  it  in  software  while  keeping  in  mind 
the  strengths  of  the  underlying  hardware  architecture.  Some  examples  include  a 
block- sparse  linear  solver. 

•  Characterization  of  performance  relative  to  programming  language,  e.g.,  CUD  A  vs. 
OpenACC.  For  this  particular  set  of  CFD  codes,  OpenACC  codes  ran  up  to  twice  as  slow 
as  their  CUD  A  equivalents. 

•  Rigorous  verification  of  the  correctness  of  GPU- accelerated  CFD  codes. 

•  Maturation  of  newly  researched  and  developed  numerical  methods  from  2014  for  many- 
core  GPU  environments,  e.g.,  minimizing  execution  time  via  fewer  expensive  iterations 
versus  more  cheap  iterations,  GPU-parallelized  Rosenbrock- Krylov  method,  and 
integrated  co-design  of  CUDA-parallelized  solvers  with  CFD. 

•  New  numerical  methods,  including  block-orthogonal  Rosenbrock- Krylov  (ROK)  / 
Exponential  Krylov  (EXPK)  methods,  linearly  implicit  Runge-Kutta  W  (LIRK-W),  and 
implicit-explicit  general  linear  methods. 

•  Realization  of  a  novel  memory  analysis  tool  that  can  be  used  to  project  the  efficacy  of 
porting  a  code  from  CPU  to  GPU. 

•  Abstraction  of  additional  common  data  structures  and  algorithmic  dwarfs  (or  motifs)  and 
their  optimization  and  incorporation  into  an  accelerator-based  library  framework  called 
MetaMorph,  which  for  the  first  time  anywhere  enables  different  accelerator  devices  to 
interoperate,  e.g.,  a  program  simultaneously  run  on  an  AMD  GPU,  NVIDIA  GPU,  and 
Intel  Xeon  Phi. 

•  Creation  of  a  prototypical  runtime  system  called  CoreTSAR,  short  for  Core  Task-Size 
Adapting  Runtime,  that  automatically  distributes  tasks  across  CPUs  and  GPUs  to  execute 
simultaneously.  (This  is  in  contrast  to  the  bulk-  synchronous  parallel  style  of  execution 
that  has  been  traditionally  adopted  by  the  high-performance  computing  community, 
where  execution  alternates  between  the  CPU  and  GPU.) 


DISTRIBUTION  A:  Distribution  approved  for  public  release. 


AI-OSR  Deliverables  Submission  Survey 


Response  ID :691 1  Data 

1. 

Report  Type 
Final  Report 
Primary  Contact  Email 

Contact  email  if  there  is  a  problem  with  the  report. 

wfeng@vt.edu 

Primary  Contact  Phone  Number 

Contact  phone  number  if  there  is  a  problem  with  the  report 

+1-540-231-1192 

Organization  /  Institution  name 

Virginia  Tech 

Grant/Contract  Title 

The  full  title  of  the  funded  effort. 

AFOSR  BRI:  Co-Design  of  Flardware/Software  for  Predicting  MAV  Aerodynamics 

Grant/Contract  Number 

AFOSR  assigned  control  number.  It  must  begin  with  "FA9550"  or  "F49620"  or  "FA2386". 

FA9550-1 2-1 -0442 

Principal  Investigator  Name 

The  full  name  of  the  principal  investigator  on  the  grant  or  contract. 

Wu-chun  Feng 
Program  Officer 

The  AFOSR  Program  Officer  currently  assigned  to  the  award 
Jean-Luc  Cambier 
Reporting  Period  Start  Date 

09/01/2012 

Reporting  Period  End  Date 

10/31/2015 

Abstract 

While  Moore's  Law  theoretically  doubles  processor  performance  every  24  months,  much  of  the  realizable 
performance  remains  untapped  because  the  burden  falls  to  the  (less  informed)  domain  scientist  or 
engineer  to  exploit  parallel  hardware  for  performance  gains.  Even  when  such  untapped  hardware  potential 
is  fully  realized,  it  is  often  not  coupled  with  advances  in  algorithmic  innovation,  which  can  deliver  further 
(multiplicative)  speed-up  beyond  Moore's  Law,  as  noted  in  the  AFOSR  BAA.  In  this  project,  we  propose  a 
formal  co-design  process  for  the  structured  grid  and  unstructured  grid  motifs  found  in  computational  fluid 
dynamics  (CFD)  in  support  of  aerodynamic  predictions  for  micro-air  vehicles  (MAVs).  While  many  past 
efforts  to  develop  such  CFD  codes  on  accelerated  processors  showed  limited  success,  our 
hardware/software  co-design  approach  created  malleable  algorithms  that  could  be  mapped  and  optimized 
onto  the  right  type  of  processing  core  at  the  right  time,  and  in  turn,  deliver  an  order-of-magnitude  better 
performance  than  would  have  otherwise  been  possible  by  Moore's  Law  alone. 
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